214 research outputs found

    A Rule Based Taxonomy of Dirty Data

    Get PDF
    There is a growing awareness that high quality of datais a key to today’s business success and that dirty data existingwithin data sources is one of the causes of poor data quality. Toensure high quality data, enterprises need to have a process,methodologies and resources to monitor, analyze and maintainthe quality of data. Nevertheless, research shows that manyenterprises do not pay adequate attention to the existence of dirtydata and have not applied useful methodologies to ensure highquality data for their applications. One of the reasons is a lack ofappreciation of the types and extent of dirty data. In practice,detecting and cleaning all the dirty data that exists in all datasources is quite expensive and unrealistic. The cost of cleaningdirty data needs to be considered for most of enterprises. Thisproblem has not attracted enough attention from researchers. Inthis paper, a rule-based taxonomy of dirty data is developed. Theproposed taxonomy not only provides a mechanism to deal withthis problem but also includes more dirty data types than any ofexisting such taxonomies

    A Comparison of Techniques for Name Matching

    Get PDF
    Information explosion is a problem for everyone nowadays. It is a great challenge to all kinds of businesses to maintain high quality of data in their information applications, such as data integration, text and web mining, information retrieval, search engine, etc. In such applications, matching names is one of the popular tasks. There are a number of name matching techniques available. Unfortunately, there is no existing name matching technique that performs the best in all situations. Therefore, a problem that every researcher or a practitioner has to face is how to select an appropriate technique for a given dataset. This paper analyses and evaluates a set of popular name matching techniques on several carefully designed different datasets. The experimental comparison confirms the statement that there is no clear best technique. Some suggestions have been presented, which can be used as guidance for researchers and practitioners to select an appropriate name matching technique in a given dataset

    Supporting taxonomic names in cell and molecular biology databases.

    Get PDF
    Groups of organisms require labels or names to refer to them, however the idea of a single static name index, although tempting for its simplicity, is both impractical and unadvisable as a basis for referring to organisms for which data has been collected and stored for analyses and sharing. The relevant issues are described and some of the challenges facing database researchers are discussed

    Visual Encodings for Networks with Multiple Edge Types

    Get PDF
    This paper reports on a formal user study on visual encodings ofnetworks with multiple edge types in adjacency matrices. Our tasksand conditions were inspired by real problems in computationalbiology. We focus on encodings in adjacency matrices, selectingfour designs from a potentially huge design space of visual encodings.We then settle on three visual variables to evaluate in acrowdsourcing study with 159 participants: orientation, positionand colour. The best encodings were integrated into a visual analyticstool for inferring dynamic Bayesian networks and evaluated bycomputational biologists for additional evidence.We found that theencodings performed differently depending on the task, however,colour was found to help in all tasks except when trying to find theedge with the largest number of edge types. Orientation generallyoutperformed position in all of our tasks

    Extending taxonomic visualisation to incorporate synonymy and structural markers.

    Get PDF
    The visualisation of taxonomic hierarchies has evolved from indented lists of names to techniques that can display thousands of nodes and onto hundreds of thousands of nodes over multiple taxonomies. However, challenges remain within multiple hierarchy visualisation, and for taxonomic hierarchy visualisation in particular. Firstly, at present, there is no support for handling specific taxonomic information such as synonymy, with current visualisations matching solely on names. Synonymy is extremely important as it reflects expert opinion on the compatibility of data held in separate taxonomies, and is needed to produce an accurate picture of taxonomic overlap. Also, current techniques for exploring large hierarchies find it difficult to convey internal reorganisations between hierarchies, with most systems showing only addition, removal or wide-ranging fragmentation of information between taxonomies. Finding the source of changes that have occurred within an existing structure is currently only achievable through exhaustive drill-down exploration. This paper describes work that tackles these problems, incorporating synonymy information into a model for multiple hierarchy visualisation of large taxonomies, and also detailing techniques that aid navigation for discovering structural re-organisations between hierarchies and for revealing information about nodes that lie below the effective display resolution of the hierarchy layout. Two examples on real taxonomic data sets are annotated to show the effectiveness of these techniques in operation

    A survey of multiple tree visualisation.

    Get PDF
    This paper summarises the state-of-the-art in multiple tree visualisations. It discusses the spectrum of current representation techniques used on single trees, pairs of trees and finally multiple trees, in order to identify which representations are best suited to particular tasks and to find gaps in the representation space where opportunities for future multiple tree visualisation research may exist. The application areas from where multiple tree data are derived are enumerated, and the distinct structures that multiple trees make in combination with each other and the effect on subsequent approaches to their visualisation are discussed, along with the basic high-level goals of existing multiple tree visualisations

    Exploring multiple trees through DAG representations

    Get PDF
    We present a Directed Acyclic Graph visualisation designed to allow interaction with a set of multiple classification trees, specifically to find overlaps and differences between groups of trees and individual trees. The work is motivated by the need to find a representation for multiple trees that has the space-saving property of a general graph representation and the intuitive parent-child direction cues present in individual representation of trees. Using example taxonomic data sets, we describe augmentations to the common barycenter DAG layout method that reveal shared sets of child nodes between common parents in a clearer manner. Other interactions such as displaying the multiple ancestor paths of a node when it occurs in several trees, and revealing intersecting sibling sets within the context of a single DAG representation are also discussed

    Visual Exploration of Alternative Taxonomies through Concepts

    Get PDF
    A graphical user interface is presented that allows users of taxonomic data to explore concept relationships between conflicting but related taxonomic classifications. Ecological analyses that use taxonomic metadata depend on accurate naming of specimens and taxa, and if the metadata involves several taxonomies, care has to be taken to match concepts between them. To perform this accurately requires expert-defined concept relationships, which are more complex yet more representative than the simple one-to-one mappings found through simple name matching, and can accommodate nomenclatural changes and differences in classification technique (cf ‘lumpers’ versus ‘splitters’). In the SEEK-Taxon (Scientific Environment for Ecological Knowledge) project we aim to help users of taxonomic datasets untangle and understand these relationships through a prototype visual interface which graphically displays these relationship structures, allowing users to comprehend such information and more accurately name their data
    • …
    corecore